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1. INTRODUCTION 

Machine learning [1]—[4] is the most exciting science today in the research community, which is 
characterised by its ability to design and develop algorithms that allow machines to learn [5], [6]. It is a sub- 
field of artificial intelligence where the learning process consists of automatically extracting rules and 
patterns from a data file [7], [8]. Machine learning is closely related to fields such as data mining, statistics, 
pattern recognition, other things [9]-[11]. Supervised machine learning algorithms are illustrated by using 
new practices to predict future events and using what has been learned from past practices to recent data 
[12]-[15]. In addition, these algorithms analyse well-known scaling data through which they produce a 
function to make predictions about the output values, whereby the system can provide targets for any new 
input after adequate training [16]-[18]. Furthermore, machine learning algorithms can compare their 
calculated and accurate outputs to find errors in which the model can be modified accordingly [19]-[22]. One 
of the most classical machine learning techniques utilised for prediction is the random forest [23]—[25]. This 
technique is marked by being more flexible and straightforward to predict [26], as the forest consists of trees, 
and it is said that the more trees, the more influential the forest. In other words, the random forest generates 
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decision trees based on randomly selected data samples [27], [28]. Then the predictions are got from each 
tree, and the best accurate result is chosen through voting, which is a good indication of the significance of 
this technique [29]. In general, this technique is employed for both classification and regression [30], [31]. 

The most critical issue that databases face is the existence of null value [32], as organisations rely 
heavily on the collection, storage, and analysis of this value for decision-making purposes. In short, a null 
value can be described as an empty field and means that the values are missing or unknown. Databases are a 
set of columns and rows that include data [33], but some of them will consist of a null or missing value [34]. 
Moreover, dealing with or knowing this value is not effortless as it may take a great time to realise it and 
understand its whereabouts [35]. As a result, databases suffer significantly from the problem of empty data 
that leads to inaccurate records and incorrect calculations, which leads to a return to the traditional manual 
method of data entry and therefore there will be a great effort and time in managing the database and 
consequently unreliable data will be obtained. 

The foremost contribution of this scenario is to make different modifications to the random forest 
algorithm to impute the null value from five datasets gathered from the University of California Irvine (UCI) 
machine learning repository. The modification process depends on three main things (bootstrap with less 
redundancy, add features selection, and modified ranking stage) that are improved within the algorithm. Also, 
this scenario compares the modified algorithm with the algorithm without modification to know the 
performance of the two approaches in estimating null values and reaching convincing effects. 


2. LITERATURE SURVEY 

This section will address a bunch of literature involved in the random forest technique in solving a 
null values or missing values in large datasets. In a study executed by Sadiq et al. [36], they proposed using 
swarm intelligence and iterative dichotomiser 3 (ID3) techniques to solve the problem of null values in a 
large set of data. The intelligent swarm algorithm is used to feature selection that represents the bee’s 
algorithm, while ID3 is used to find the statistics effects. This study makes a comparison between these two 
approaches for estimating null values; the outcomes indicate that the best performance is for ID3 in finding 
results without affecting the accuracy of the null value and no matter how much these values improved. Sadiq 
and Chawishly [37] executed the growth and improvement of the performance of the ID3 algorithm to solve 
the problem of null values in a large dataset. This investigation concluded, in the event of the happening of 
null values one and two with a row, the proposed system has the ability to estimate 99% of the null values, as 
well as if three null values appear within the row, the approximation is 97%, which are efficient and sound 
effects. In a study conducted by Ramosaj and Pauly [38], they suggested involving several techniques 
(stochastic gradient tree boosting, C5.0 algorithm, and random forest) in predicting missing values from 
credit information and Facebook data. The authors are able to develop these techniques to work more 
efficiently, as they are able to analyse the performance of obtaining continuous categorical and mixed data. It 
is concluded that the best performance was for the random forest as it gave high effects in finding the missing 
values in less time. 

According to Salman et al. [39], they presented developing a random forest algorithm to increase its 
performance via meerkat clan algorithm to impute the missing value. After 100 iterations, the performance 
and accuracy of the random forest are good in calculating these values, but at 200 and 300 iterations, the 
execution becomes more complex. Increasing the block size in the modified algorithm improves the accuracy 
of null-value computation. This paper is characterized by the use of types of null values (categorical and 
numeric), which makes this piece more efficient. In a study by Jackins et al. [40], suggested that artificial 
intelligence techniques (naive bayes and random forest) be applied to predict diabetes, heart disease, and 
breast cancer. The database for this investigation is taken from National Institute of Diabetes and Digestive 
and Kidney Diseases (NIDDK), and all patients’ data are from 21-year-old females. After running several 
experiments, it is found that a missing value is replaced with null values. The results of this study prove the 
ability of the techniques to remove the missing value and the efficiency of data classification. Another study 
executed by Gok and Olgun [41] collected blood samples from patients from Einstein Hospital in Brazil. 
They used them to predict the level of severity of COVID-19 utilising machine learning algorithms (decision 
tree, random forest, k-nearest neighbour, support vector machine classifier, gradient boosting, Gaussian naive 
bayes, multi-layer perceptron, Gaussian process). A set of missing data appeared during the work that 
affected the work, but they can use several approaches to fill the missing values, which are replaced with the 
most common value. This study got an accuracy of 0.98 from the random forest classifier. 


3. METHOD 
Random forest is a supervised algorithm [42]-{45]; by its name, its work is understood, and it makes a 
random forest that is its goal. It relies in its work on creating multiple decision trees and combining them to obtain 
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a more accurate and stable prediction. In general, the more trees in the forest, the more elevated the algorithmic 
power. This algorithm is adaptable and effortless to utilise [46]{48]; even without parameter modification, it 
produces an impressive and desirable effect most of the time. In addition, many recent studies have appeared in 
this technique; for instance, it is utilised to analyse x-ray images of COVID-19 patients and many other 
applications [49]-[52]. This algorithm is concisely employed for data analysis and predicting due to its simplicity 
[53]}H{55]. Figure 1 illustrates the steps for creating a random forest classifier [56]. In the working steps of this 
algorithm, randomisation is added to the proposed model as the trees grow. The most useful property will be 
chosen among a random subset of the net properties at each step instead of searching for the most important 
property when dividing the nodes. Thus, a more acceptable model and a wide variety will be constructed. 


Training 
Set 


Bootstrap 
Sampling 


Training Training 
Setl Setn 


Figure 1. Random forest steps 


Moreover, this algorithm considers a random set of properties when partitioning nodes. For instance, 
using additional random thresholds generates more random trees for each function rather than searching for 
the best possible terms as a standard decision tree does. As mentioned earlier, random forests are a collection 
of decision trees [57], [58], but there are several discrepancies between one and the other. Besides, if a 
training data set with characteristics and labels is joined into a decision tree, it will formulate a set of rules 
that will be operated to create the predictions. For instance, in social networking sites, if want to predict 
whether a person will click on a specific advertisement, this is done by gathering information about the 
advertisement and the person who clicked on the advertisement in the past and some characteristics that 
describe his/her decision. If these characteristics are put in a decision tree, then some rules are designed to 
predict whether the ad will be clicked. The random forest selects observations and characteristics randomly to 
make many decision trees and then averages the effects. When decision trees are too deep, they can suffer 
from overfitting. On the other hand, random forests avoid over-adaptation most of the time, making random 
subsets of characteristics and making smaller trees employing these subsets, then merging the sub-trees later. 
This function slows down the work, relying on how many trees the forest randomly generate. 

There are several important modifications for random forest algorithm in more than one side of it. In 
[59] random forest was modified by adding double feature selection to filter the relevant features. According to 
Fornaser et al. [60], a modified random forest algorithm called Sigma-z it is treat with two points the lack of any 
metrological characterization of the inputs passed to the model, such as the uncertainty of the data, and the lack 
of an assessment of the reliability of the results. Sigma-z consider the original classification structure, leaving it 
untouched, and the distribution of the training datasets. An overlaying structure statistically combines the two, 
and also includes in the process the propagation of feature uncertainties as a further element deriving from input 
measurements. In Mohsen and Sadiq [61], a ranked voting strategy based on accuracy values was proposed 
instead of classical voting, ranked voting based on the accuracy of each tree with different weights. Used one 
hot encoding as a representation method for the target of random forest, this technique gave good results 
compare with classical one [62]. The random forest algorithm is elected in this scenario for two major reasons: 
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it is less inclined to overfitting than decision tree and other algorithms, and it's essential to demonstrate the 
significance of features. The overfitting phenomenon is more insignificant in the tree if the dataset increases, as 
a sufficient amount of data assists machine learning models in finding new patterns efficiently. 


4. THE PROPOSED WORK 

This scenario concentrates on modifying three critical points in the random forest algorithm: 
bootstrap with less redundancy, add features selection method, and modified ranking stage. Besides, 
bootstrap is a crucial stage in the random forest algorithm. In the modification step, a specific bootstrap 
strategy is based on decreasing the redundant of samples. Reducing the redundancy will increase the diversity 
of samples. Algorithm 1 illustrate the essential idea (steps) of bootstrap with less redundancy. This idea will 
guarantee a fair diversity of bootstrap samples that leads to different trees in the random forest. 


Algorithm 1. Bootstrap with less redundancy 

Fori=1to No.of Samples Do 

Repeat 

Select sample k; 

Until 

the similarity between sample k and other is greater than threshold 
End For 


Moreover, to increase the performance of the random forest algorithm in the null-value estimation 
problem, the proposed modification of this algorithm concentrates on several steps. Features selection step 
plays a significant role to increase the accuracy of the random forest algorithm. Thus, the proposed 
modification will be making this step hybrid, it depends on the hybrid feature selection method. This method 
indicates that the selected features will be depending on more than one feature. Also, this method can be 
calculated with (1). From this equation, the random forest will be selecting the features depending on two 
feature selection methods. Thus, the selected features will be more powerful and relevant to the target. 


Hybrid Feature Selection = 
w * Feature Selectionl + (w —1) * Feature Selection2 (1) 


Another modification is based on the ranking strategy of trees. In fact, the random forest algorithm 
before it is modified builds a set n of tree classifications to assume the assumed outcome from the predictors. 
In addition, each tree is trained on a different specific sample of N subjects with a random subset of m tries 
predictors believed in every node from the tree. The primary purpose of random forest is to aggregate tree- 
level effects evenly across trees. In general, the traditional random forest algorithm is enforced for structuring 
forest trees, but the ranking is based on the undertaking of tree aggregation. Notably, every tree in the forest's 
ranking class 'votes' is believed. Thus, the superior-performing trees are ranked extra accurate. In other 
words, the ranking depends directly on the performance; its execution on another data set that is matching 
and differs in size will lead to calculating the bias prediction error rating. The data diverges originally into 
training and testing sets during the traditional performance of this algorithm in order to avert the bias while 
making trees on the bootstrap samples. By utilising the individuals of out-of-bag error, the predictive rating 
ability for each tree is calculated. In this scenario, the training data of ranking random forests included three 
quarters of the actual sample. Thus, approximately one half of the completed sample is in-bag in every tree, 
is employed to construct the tree, and one quarter is out-of-bag. Likewise, it is used to estimate tree 
implementation to calculate tree accuracy. Subsequent, the tree accuracy is calculated in the training data. 
Also, n trees are operated to gain votes for one quarter by observing independent test groups, where the votes 
(predicted classifications) over trees using ranking. Algorithm 2 illustrates the stages of the modified random 
forest algorithm with the ranking prediction for the class, which is based on every tree in it. The principal 
stages of this scenario are: 

— Stage I: n no. of random records is accepted from the dataset having k no. of records. The samples 
selected are founded on the proposed less redundancy bootstrap. 

— Stage II: Unique decision trees are created for each sample. Each tree has distinctive features depending 
on hybrid features selection. 

— Step III: Each decision tree will generate an effect. 

— Step IV: The Final effect is evaluated based on ranked voting for classification. 


Algorithm 2. Random forest modification 
Begin 
For each tree in the random forest 

if (0.6 < Tree’s Accuracy < 0.80) Then 
add tree to predict list within 1 point; 
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else if (0.8 < Tree’s Accuracy < 0.95) Then 
add tree to predict list within 2 points; 
else if (Tree’s Accuracy > 0.95) Then 
add tree to predict list within 3 points; 
end if 
end for 
predict = obtain more frequency tree in predict 
appropriate = match between predict all and actual 
Rankyote = correct + no.of class in test data 
End 


5. EXPERIMENTAL RESULTS 
5.1. Dataset description and parameters 

In this scenario, the proposed algorithm is executed on five datasets shows in Table 1. The first 
dataset is connectionist Bench include sonar, mines vs. rocks dataset [63]. The assignment is to train a 
network to determine sonar signals reflected off a metal cylinder and those reflected off a roughly cylindrical 
rock. This dataset includes files; the first is "sonar. mines" consists of 111 patterns achieved by bouncing 
sonar signals off a metal cylinder at different angles and under other circumstances. The second is "sonar. 
rocks" with 97 patterns earned of rocks under the equal status. The transmitted sonar signal is a frequency- 
modulated chirp, growing in frequency. Moreover, this dataset is characterised as the signals from the 
collection of different part angles, travelling 90 degrees for the cylinder and 180 degrees for the rock. In 
addition, every pattern in this dataset consists of a set of 60 numbers between the scopes of 0.0 to 1.0. Also, 
every number symbolises the energy within a characteristic frequency band, integrated over an express length 
of time. The integration aperture for heightened frequencies materialises later since these frequencies are 
subsequently transmitted during the chirp. The label connected with every record includes (R) if the object is 
a rock while (M) if it is a mine (metal cylinder). On the other hand, the labels' numbers are in growing order 
of factor angle, but the angle is not encoded directly. The second dataset [64] is data collected from phishing 
sites, namely phish tank archive, Google searching operators, miller smiles archive while the third dataset is 
breast cancer Wisconsin [65]. The fourth dataset is Ionosphere dataset classification of radar returns from the 
ionosphere [66]. Finally, the fifth dataset is COVID-19 pandemic. There are several parameters in the 
proposed modified random forest algorithm for null-values imputation. Table 2 includes each parameter's 
ranges value. In this scenario, four feature selection methods have been utilised in the experiments: 
Information Gain, Gini Index, Chi-Squared and Correlation. 


Table 1. Dataset description 


Datasets 

Connectionist bench Phishing websites __ Breast cancer wisconsin _ Ionosphere COVID-19 
Data set characteristics Multivariate Multivariate Multivariate Multivariate Multivariate 
Attribute characteristics Real Integer Integer Integer, Real N/A 
Associated tasks Classification Classification Classification Classification Classification 
Number of instances 208 1353 699 351 14 
Number of attributes 60 10 10 34 oh 
Null values? N/A N/A Yas No N/A 
Area Physical Computer Life Physical Computer 
Date donated N/A 2016-11-02 1992-07-15 1989-01-01 2020-04-24 
Number of web hits 116625 33498 389445 166509 47691 


Table 2. Parameters ranges 


Parameter Values Means 
No. of trees 10-30 Number of trees in the random forest. 
Weight (W) 0.4-0.6 Weight for hybrid features selection. 
Threshold 0.6-0.8 Threshold value for bootstrap sampling. 
No. of null values 1-3 Number of null values in each dataset row. 


5.2. The effects and discussion 

Several experimental results have been conducted to test the proposed algorithm within ranges of 
parameters in Table 2. Typically, three matters of null-values imputation, which are loss 1, 2 and 3 values in 
each row, have been taken, respectively. The proposed algorithm has experimented with 10, 20 and 30 trees 
in the random forest in each matter. Also, applied different values of threshold (0.6, 0.7, 0.8) and different 
weight of hybrid feature selection (W=0.4, 0.5 and 0.6). The accuracy of null-values estimation is computed 
using (2): 
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Actual Correct Null Values 
Null Value Accuracy = (2) 


Desired Null Values 

Several experiments selected two important feature selection methods (Information Gain and Gini 
Index) within different weight values. Directly, the effects of this scenario will be given. Matter I: loss 1 
value in each row, the experimental results performance is exhibited in Tables 3-5. Matter II: loss 2 values in 
each row, the experimental results performance is exhibited in Tables 6-8. Matter II: loss 3 values in each 
Row, the experimental results performance is exhibited in Tables 9-11. Also, the original random forest 
algorithm runs on the same dataset matters. Table 12 displays the most acceptable results of the proposed 
work compared with the original random forest. 


Table 3. One null-value accuracy within 10 trees 


Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
W=0.4 W=0.5 W=0.6 W=0.4 W=0.5 W=0.6  W=0.4 W=0.5  W=0.6 
Connectionist bench 15% 13% 14% 19% 21% 18% 16% 17% 15% 
Phishing websites 67% 66% 66% 68% 69% 67% 68% 68% 66% 
Breast cancer 85% 86% 85% 85% 87% 86% 84% 86% 85% 
Ionosphere 32% 33% 33% 32% 34% 33% 32% 34% 33% 
COVID-19 9% 9% 8% 9% 11% 12% 11% 13% 13% 
Table 4. One null-value accuracy within 20 trees 
Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
W=0.4 W=0.5 W=0.6 W=04 W=0.5 W=0.6 W=0.4  W=0.5  W=0.6 
Connectionist bench 16% 18% 18% 21% 23% 19% 17% 16% 16% 
Phishing websites 68% 67% 66% 69% 11% 70% 68% 69% 67% 
Breast cancer 85% 87% 86% 89% 93% 87% 85% 86% 85% 
Ionosphere 34% 34% 33% 35% 36% 34% 33% 34% 33% 
COVID-19 11% 11% 10% 12% 11% 12% 10% 11% 10% 
Table 5. One null-value accuracy within 30 trees 
Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
: W=0.4 W=0.5 W=0.6 W=0.4 W=0.5 W=0.6  W=0.4 W=0.5  W=0.6 
Connectionist bench 16% 18% 18% 20% 20% 19% 17% 16% 16% 
Phishing websites 68% 68% 66% 68% 69% 10% 67% 65% 66% 
Breast cancer 86% 88% 85% 86% 89% 86% 84% 87% 84% 
Ionosphere 36% 34% 35% 36% 39% 35% 34% 35% 33% 
COVID-19 12& 11% 11% 12% 13% 11% 12% 11% 10% 
Table 6. Two null-values accuracy within 10 trees 
Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
; W=0.4 W=0.5 W=0.6 W=0.4 W=0.5 W=0.6 W=0.4 W=0.5  W=0.6 
Connectionist bench 13% 15% 14% 14% 16% 15% 13% 13% 14% 
Phishing Websites 2% 2% 1% 2% 3% 1% 2% 1% 1% 
Breast Cancer 771% 74% 771% 19% 78% 711% 78% 171% 718% 
Ionosphere 15% 15% 14% 15% 16% 14% 14% 13% 13% 
COVID-19 1% 1% 1% 1% 1% 1% 2% 1% 1% 
Table 7. Two null-value accuracy within 20 trees 
Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
. W=0.4 W=0.5  W=0.6 W=0.4 W=0.5  W=0.6  W=0.4 W=0.5  W=0.6 
Connectionist bench 15% 16% 15% 17% 17% 15% 14% 13% 14% 
Phishing websites 2% 3% 1% 3% 4% 1% 3% 1% 1% 
Breast cancer 83% 82% 83% 84% 86% 83% 84% 83% 83% 
Ionosphere 15% 16% 15% 16% 17% 13% 14% 12% 11% 
COVID-19 2% 2% 1% 2% 2% 3% 2% 1% 1% 
Table 8. Two null-values accuracy within 30 trees 
Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
W=0.4 W=0.5  W=0.6 W=04 W=0.5 W=0.6 W=04  W=0.55 W=0.6 
Connectionist bench 15% 14% 15% 17% 17% 15% 15% 13% 14% 
Phishing websites 2% 3% 1% 3% 4% 1% 3% 1% 1% 
Breast cancer 82% 83% 83% 82% 85% 83% 83% 82% 82% 
Ionosphere 15% 16% 15% 15% 18% 16% 15% 13% 11% 
COVID-19 1% 1% 2% 2% 2% 2% 2% 1% 2% 
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Undoubtedly, the problem of null values is one more complex problem for several reasons such as: 
i) Weakness of datasets because of no real associations among the attributes or features of these datasets; 
ii) Weakness of some null values associated with the target or other completed attributes/features; iii) Little 
completed data compared with the size of null values; and iv) The nature of the dataset, for instance, hasn’t a 
strong association or relevance between the features and target, even among the features. 


Table 9. Three null value accuracy within 10 trees 


Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
W=0.4 W=0.5 W=0.6 W=0.4 W=0.55 W=0.6 W=0.4 W=0.5  W=0.6 
Connectionist bench 10% 11% 10% 10% 12% 12% 11% 10% 11% 
Phishing websites 1% 1% 1% 2% 2% 1% 2% 1% 1% 
Breast cancer 771% 78% 78% 711% 78% 718% 715% 76% 16% 
Ionosphere 5% 4% 5% 6% 5% 3% 2% 2% 2% 
COVID-19 2% 2% 2% 1% 1% 2% 1% 1% 1% 


Table 10. Three null-value accuracy within 20 trees 


Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
W=0.4 W=0.5 W=0.6 W=0.4 W=0.55 W=0.6 W=0.4 W=0.5  W=0.6 
Connectionist bench 10% 12% 11% 11% 14% 12% 12% 10% 11% 
Phishing websites 2% 3% 1% 3% 4% 1% 3% 1% 1% 
Breast cancer 718% 718% 78% 71% 19% 718% 76% 171% 16% 
Ionosphere 5% 5% 4% 6% 6% 4% 2% 3% 2% 
COVID-19 2% 1% 1% 2% 3% 1% 2% 1% 1% 
Table 11. Three null-value accuracy within 30 trees 
Dataset Threshold =0.6 Threshold =0.7 Threshold =0.8 
W=0.4 W=0.5  W=0.6 W=04 W=0.5  W=0.6  W=0.4 W=0.5  W=0.6 
Connectionist bench 10% 12% 11% 11% 13% 12% 12% 10% 11% 
Phishing websites 2% 3% 1% 3% 4% 1% 3% 1% 1% 
Breast cancer 16% 78% 171% 16% 71% 78% 716% 16% 76% 
Ionosphere 5% 6% 5% 6% 1% 3% 3% 2% 2% 
COVID-19 2% 1% 1% 2% 3% 1% 3% 1% 1% 


Table 12. Null-value accuracy using random forest and modified random forest 


Dataset No. of null values Random forest Modified random forest 
1 18% 23% 
Connectionist bench 2 11% 17% 
3 2% 4% 
1 67% 11% 
Phishing websites 2 1% 4% 
3 1% 4% 
1 83% 93% 
Breast cancer 2 171% 86% 
3 67% 19% 
1 18% 36% 
Ionosphere 2 9% 17% 
3 2% 6% 
1 15% 11% 
COVID-19 2 4% 2% 
3 1% 3% 


No. of tree =20, W=0.5, Threshold=0.7 and (information gain and gini index) 


Through the above reasons, some results are unsuitable or don’t meet ambition in predicting effects. 
In this scenario, the most profitable results have been obtained through the number of trees =20, W =0.5, 
threshold =0.7 and the two feature selection methods (information gain and gini index). The performance of 
the modified random forest results increased by 9.5%, 6.5% and 5.25% of 1, 2 and 3 null values, 
respectively. The results depended on average values for the five datasets. Besides, the nature of the dataset 
plays a significant role in increasing the accuracy of null-values estimation. In addition, one null value 
imputation gave a good result for all the five datasets, two null values gave less than one null value, and three 
null values showed minor effects. The breast cancer dataset gave the best results compared with the four 
others. Connectionist bench, Phishing websites, and Ionosphere datasets gave inadequate effects within two 
and three null values. While the performance with COVID-19 is not satisfactory. 
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6. CONCLUDING REMARKS AND FUTURE DIRECTION 

The modified random forest algorithm focuses on three modifications to increase the performance of 
the original one, less redundancy bootstrap, hybrid features selection and ranked voting. These three 
modifications made the random forest algorithm more efficient by selecting diverse samples using less 
redundancy bootstrap and more than one feature selection method to enhance the selected features more 
relevant to the target. Lastly, the voting strategy is based on ranking the trees. Also, these three modifications on 
the random forest algorithm gave enhanced results compared to the original one. The experimental results for 
the five datasets showed significant improvement in outcomes by 9.5%, 6.5% and 5.25% for one, two, and three 
null values, respectively. In the null values imputation problem, increasing the number of missing values 
decreases the imputation accuracy. Also, the nature of the dataset plays a significant role in the imputation; 
some dataset does not contain relational relevance in their attributes, which causes poor extracted learned rules. 
Unfortunately, these inadequate, learned rules don’t enough to estimate the missing values. In the future, other 
machine learning techniques will be applied to solve the situation of null values in the same datasets. 
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